---
title: "Hypothesis 2"
description: "(a) The use of population terminology has increased in biomedical abstracts since 1990 and (b) The use of terms like genetic ancestry as well as national and continental terms have outpaced racial and ethnic terms in biomedical literature."
output: html_document
weight: 2
---
While our first set of analyses established that “diversity” is increasing modestly in the biomedical literature, we also wanted to account for how terminology related to population testing has changed over time. Fujimura and Rajagopalan’s (2011) have suggested that researchers now use biogeographic and genetic ancestry terms more often than race and ethnicity in their studies. Relatedly, Panofsky and Bliss (2017) have found that racial and ethnic terms (including those within the US Census and OMB Directive 15 labeling schema) have been replaced by more general biogeographic terms such continental, national, and directional terminology. Below, we test how these sets of terms vary over time using computational text analysis.
To do this, we created a dictionary with various sets of terms corresponding to the existing literature. This dictionary includes a comprehensive (though not necessarily exhaustive) list of around 2,000 continental, subcontinental, national, directional, ancestry, and OMB/US Census terms. Additionally, we compiled a list of roughly 6,500 unique terms of ethnic, tribal, and caste terms that we umbrella under the category of “subnational.” In addition to an interactive tree diagram that visualizes these term sets, we have also included a searchable table to explore which terms are included in each category set. While future work will need to explore how these categories overlap and intertwine, the forthcoming analyses simply demonstrate how these term sets vary over time within the biomedical and diversity samples.
In Figure 2A, the raw growth trends for the biomedical sample demonstrate that the use of population terminology has grown tremendously over time. In the top red line, we see that when all 8,200+ population terms are combined together into one set, the growth trends increase from just 166 in 1990 to more than 5,000 in 2017. As we can see in the orange and yellow lines, the majority of population terminology is a result of national and continental terms being used more often. While subnational, subcontinental, directional, racial/ethnic, and OMB/US Census terms do rise, these trends never top more than 650 instances in a given year. In fact, the terms in the ancestry set have never surpassed 50 instances per year throughout our sampling period.
Looking at these trends as proportions, Figure 2B again demonstrates that the overall growth of all population terms (20.06% in 1990 to 30.88% in 2017) is mostly the result of growth in national and continental terms, which increase 9.52% (from 13.06% to 22.58%) and 5.84% (from 7.64% to 13.48%) respectively. On the other hand, the subcontinental, subnational, directional, ancestry, racial/ethnic, and OMB/US Census sets have seen almost no change over time. These results suggests that if the use of population terms is increasing over time, it again is doing so quite modestly and happens mostly through the use of national and continential terms. This is consistent with the Panofsky and Bliss (2017) who, in a much smaller sample of publications from Nature Genetics, also find that national and continential terms are becoming the vernacular of choice amoung leading biomedical scholars when conducting population difference testing.
When we replicate these analyses on our diversity sample, Figure 2C again shows that the national and continental terms seem to be driving the majority of the population term usage in our results. In contrast with the biomedical sample, however, the trends show that OMB/US Census and racial/ethnic terms are used more often in the diversity sample.
Figure 2D shows that the overall use of population terms is indeed much higher in the diversity sample - occuring in 42.07% of articles compared to just 30.88% of the biomedical sample. This result is not surprising, as our sampling methodology inflates the totals of almost all of the category sets. For example, the OMB/US Census and racial/ethnic totals are nearly 5-6 times greater in this diversity sample. Despite these totals being higher, we do see that the use of population terms are declining from their peak of 60.9% in 1996. While the use of national terms has stayed consistent since the mid-1990’s, all of the other term sets have steadily declined since 2003. In the case of the OMB/US Census and race/ethnicity categories, the proportion of abstracts that use these term sets has reduced by nearly half over the past 15 years.
Main Takeaway
At the beginning of this section, we hypothesized that (a) the use of population terminology has increased in biomedical abstracts since 1990 and (b) that the use of terms like genetic ancestry as well as national and continental terms have outpaced racial and ethnic terms in biomedical literature. Our results suggest that population terminology has indeed increased over time within the biomedical sample. However, in the diversity sample, population terminology grew only through the early-1990s before platauing until the mid-2000’s and declining thereafter. The majority of growth in these samples has been due to an increase in national and contintental terminology, as most other term sets have either stayed relatively consistent or decreased since the mid-2000’s. Thus, while there is strong support for growth in national and continental terms, we find no evidence that the use of genetic ancestry terms have replaced population labels relating to race and ethnicity.
Appendices
Here is a list of the 8,200+ terms that were collapsed into the “all population terms” category in these analyses. You can use the search box to find a specific term you are interested in. Please note that this dictionary is still in progress and the sub/categories are both fluid and imperfect. If you have suggestions about what to add and/or reclassify, please send an email to the authors using the email link in the top-right corner.
Here are the counts for top continental terms in the biomedical sample.
Here are the counts for top subcontinental terms in the biomedical sample.
Here are the counts for top national terms in the biomedical sample.
Here are the counts for top subnational terms in the biomedical sample.
Here are examples of how “Chinese” is used in context in the biomedical sample.
## # A tibble: 772 x 2
## bigram n
## <chr> <int>
## 1 chinese biomedical 1002
## 2 the chinese 648
## 3 and chinese 456
## 4 chinese medicine 318
## 5 of chinese 290
## 6 traditional chinese 265
## 7 chinese herbal 208
## 8 in chinese 164
## 9 chinese biomedicine 147
## 10 chinese national 132
## # … with 762 more rows
Here are examples of how “American” is used in context in the biomedical sample.
## # A tibble: 1,151 x 2
## bigram n
## <chr> <int>
## 1 the american 1114
## 2 american society 474
## 3 african american 390
## 4 african americans 310
## 5 american association 245
## 6 american academy 208
## 7 american college 187
## 8 american medical 160
## 9 of american 149
## 10 2016 american 129
## # … with 1,141 more rows
Here is the correlation matrix for term sets in the biomedical sample.
Here is the correlation matrix for term sets in the diversity sample.